% THIS IS SIGPROC-SP.TEX - VERSION 3.1
% WORKS WITH V3.2SP OF ACM_PROC_ARTICLE-SP.CLS
% APRIL 2009
%
% It is an example file showing how to use the 'acm_proc_article-sp.cls' V3.2SP
% LaTeX2e document class file for Conference Proceedings submissions.
% ----------------------------------------------------------------------------------------------------------------
% This .tex file (and associated .cls V3.2SP) *DOES NOT* produce:
%       1) The Permission Statement
%       2) The Conference (location) Info information
%       3) The Copyright Line with ACM data
%       4) Page numbering
% ---------------------------------------------------------------------------------------------------------------
% It is an example which *does* use the .bib file (from which the .bbl file
% is produced).
% REMEMBER HOWEVER: After having produced the .bbl file,
% and prior to final submission,
% you need to 'insert'  your .bbl file into your source .tex file so as to provide
% ONE 'self-contained' source file.
%
% Questions regarding SIGS should be sent to
% Adrienne Griscti ---> griscti@acm.org
%
% Questions/suggestions regarding the guidelines, .tex and .cls files, etc. to
% Gerald Murray ---> murray@hq.acm.org
%
% For tracking purposes - this is V3.1SP - APRIL 2009

\documentclass{acm_proc_article-sp}
%\usepackage{setspace}
%\setstretch{2} 
%\documentclass[10pt,twocolumn]{IEEEtran}
%\bibliographystyle{IEEEtran}
%\usepackage{graphicx}
%\usepackage{pstricks}
%\usepackage{egameps}
%\setlength{\parindent}{0in}
%\setlength{\parskip}{12pt plus 1pt minus 1pt}

%\linespread{1.6}
%\setlength{\columnsep}{0.6cm}
%\usepackage[hmargin=0.9in,vmargin=0.9in]{geometry}\begin{document}

\title{Regulatory Monitoring of Energy Usage using Data Mining Techniques\titlenote{Paper for course ECE1770: Green Middleware}}
\subtitle{[Exploring Information Technology for Green Solutions]}

\numberofauthors{1} %  in this sample file, there are a *total*
\author{
\alignauthor
Patricia Hon\titlenote{Master of Engineering Student}\\
       \affaddr{University of Toronto}\\
       \affaddr{Electrical and Computer Engineering Department}\\
       \affaddr{Toronto, Canada}\\
       \email{patricia.hon@utoronto.ca}
}
\date{30 April 2010}

\begin{document}
\maketitle


\begin{abstract}
An energy usage analysis tool is presented. This tool, called EPAnalyzer, uses property assessment information and energy usage information to compare, analyze, and classify it into different usage types.  The data is analyzed to determine if the property is being used as it was zoned and that it is meeting regulatory requirements for its use.  This is achieved through analysis of existing usage profiles through data mining techniques. We leverage the Weka Project \cite{weka} to apply machine learning algorithms to data gathered from IESO\cite{Operator2010}. EPAnalyzer uses several techniques to determine the predicted energy usage for test energy usage patterns. The tool is tested and found to be reliable for most cases in determining correct building energy usage types and subtypes. For most cases, it is more reliable when provided information on the month the profile was collected. Finally, as a static analysis tool, the performance is acceptable. Limitations rest on the time required to generate models using data mining tools. As an extension, analysis of the results can be applied to determine if the property can apply for energy saving incentives. 
\end{abstract}

% A category with the (minimum) three required fields
\category{H.2.8}{Database Applications}{Data mining} \\
%A category including the fourth, optional field follows...
\category{I.6.4}{Simulation and Modeling}{Model Validation and Analysis}

\terms{Theory}
\keywords{ECE1770, Green Middleware, Data Mining, Machine Learning, Logistic Regression, Weka, Energy Usage Profile} % NOT required for Proceedings

\section{Introduction}
The problem is new regulations are being put in place for energy use, but a system needs to be in place to monitor that the regulations are met.  Additionally, this program can be used to discover illegal activities in buildings zoned as residential or vacant.  Regulatory monitoring is important because it can increase an industry's ability to optimize integrated energy efficient design. 

This work is interesting because it will require using existing simulation software to generate data. It will require research into building usage types and energy usage patterns for them. Property assessment information and building attributes will also be researched to determine appropriate input parameters. 

This work also requires applying data mining techniques on existing data and generating models. Test data is used to validate the models.

\section{Background}
Property assessment holds information which can be applied for regulatory monitoring. Information available from property assessment companies such as MPAC\cite{mpac} a as follows: property location, lot dimensions, living area, age of the property, information about renovations or additions, quality of construction, and property use. It also contains other feature information such as number of bathrooms, fireplaces, garages, pools, whether properties have water frontage, and so on. In the case of MPAC, they also maintain Enumeration information used to prepare the Preliminary List of Electors. Therefore occupant information is also collected: age, gender, ownership status. Other information can be extrapolated from comparable properties in the community to determine an assessment.  This vast supply of information would be invaluable for building models of energy usage to verify that building are being used as they are zoned to be used. The information is also useful for building models of neighbourhoods to characterize energy usage. 

Data mining is a multi-disciplinary field which combines statistics, machine learning, artificial intelligence, and database technology. It is used to explain past data and predicting the future by exploring and analyzing data \cite{Sayad} \cite{Piatetsky-Shapiro}. Data mining methods are used in diverse profiling practices such as marketing, surveillance, fraud detection, and scientific discovery \cite{wiki:datamining}. Machine Learning is the discipline concerned with creating algorithms for computers to develop behaviours based on empirical data. Algorithms must automatically detect patterns given all sorts of inputs \cite{wiki:machinelearning}. Machine Learning can be further divided into subfields of algorithm types. In this paper we use supervised learning. Supervised learning algorithms generate functions which map inputs into classes of desired outputs \cite{Rouhani-Kalleh2006}. The Logistic Regression model is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve \cite{wiki:logisticregression}.

\section{Approach}
For this tool we focus on analyzing energy profile information to determine building usage types. Information on the implementation is described in Section~\ref{sec:implementation}. The building types are grouped into industries. Commercial, Industrial, and Residential are the major usage types. These are further broken down into subtypes. Types and subtypes are described in Section~\ref{sec:usagetypes}. Logically, this paper follows a design, implementation, and evaluation methodology. Input data being analyzed is the energy profile and information on the month and the day type when the profile was gathered. Further details of input data are explained in Section~\ref{sec:inputdata}.

\section{Implementation}
\label{sec:implementation}
This section describes the implementation details of EPAnalyzer Tool. 

\subsection{High-Level Design}
EPAnalyzer is a static analysis tool which leverages knowledge of property assessment information and actual usage patterns to determine if the usage pattern matches to expected energy usage patterns. This tool is implemented in Java. It uses data collected from IESO on typical usage patterns for different sub-types of Commercial, Residential, and Industrial buildings as the library of actual data. We apply data mining techniques to model the data. The model was created using existing data mining software\cite{weka}. 

\subsection{Usage Types}
\label{sec:usagetypes}
For analysis purposes, there are three main building types: Commercial, Industrial, and Residential. Each of these types are further divided into subtypes. Commercial usage is divided into 11 segments or usage subtypes. The different Commercial subtypes are listed in Table~\ref{commercialtypes}. There are 14 different Industrial segments or subtypes. The different Industrial subtypes are listed in Table~\ref{industrialtypes}. 

For commercial buildings, the tool can also analyzes expected building type by square footage and energy usage intensity per square meter for different building sub-types. Data was collected from IESO reports \cite{Consulting2005} and is listed in Table~\ref{elecintensity}. 

Residential properties have different patterns based on whether heating or cooling systems are installed and which types of systems are used. The tool uses the input load profiles and applies it to the models built from expected load profiles for specific heating and cooling implementations. Then the actual usage type can be determined by the probability it is expected to be specific usage types. This extra matching can cross-reference the data to determine if some input data about heating and cooling systems is incorrect or outdated. There are 10 different Residential subtypes. Table~\ref{residentialtypes} lists the different Residential subtypes created for matching energy profiles in EPAnalyzer. Table~\ref{residentiallegend} describes the symbols used. 

\begin{table}
\centering	
\caption{Commercial Energy Usage subtypes}
\label{commercialtypes}
\begin{tabular}{|l|} \hline
Commercial Subtypes \\ \hline 
Food Stores \\ 
Other Retail Stores \\ 
Services \\ 
Wholesale Warehouses \\ 
Offices \\ 
Health Facilities \\ 
Education \\ 
Hotels and Other Accommodation \\ 
Recreational Facilities \\ 
Religious Institutions \\ 
Multi-residential \\
\hline\end{tabular}
\end{table}

\begin{table}
\centering	
\caption{Industrial Energy Usage subtypes}
\label{industrialtypes}
\begin{tabular}{|l|} \hline
Industrial Subtypes \\ \hline 
Farms, Forestry and Fishing \\ 
Mines and Quarries \\ 
Food Manufacturing Plants \\ 
Clothing Manufacturing Plants \\ 
Wood Products Manufacturing \\ 
Primary Metals \\ 
Fabricated Metals \\ 
Machinery Industry \\ 
Transportation Industry \\ 
Electrical and Electronic \\ 
Non-Metallic Minerals \\ 
Petroleum and Chemicals \\ 
Other Manufacturing \\ 
Construction \\
\hline\end{tabular}
\end{table}

\begin{table}
\centering	
\caption{Residential Energy Usage Combinations}
\label{residentialtypes}
\begin{tabular}{|c|c|c|} \hline
BaseLoad & Heating & Cooling \\ \hline 
BL & x & x \\ 
BL & x & AC \\ 
BL & BB & x \\ 
BL & BB & AC \\ 
BL & EF & x \\ 
BL & EF & AC \\ 
BL & SH & x \\ 
BL & SH & AC \\ 
BL & WH & x \\ 
BL & WH & AC \\ 
\hline\end{tabular}
\end{table}

\begin{table}
\centering	
\caption{Residential Energy Usage Legend}
\label{residentiallegend}
\begin{tabular}{|l|} \hline
Legend \\ \hline 
x = none \\ 
BL = Baseload \\ 
AC = Central Air \\ 
SH = Space Heating \\ 
EF = Electric Furnace \\ 
BB = Base Board \\ 
WH = Water Heating \\ 
\hline\end{tabular}
\end{table}

\begin{table}
\centering
\caption{Electricity Intensity by Sub-Sector (kWh/sq. m.) in 2003}
\label{elecintensity}
\begin{tabular}{|l|l|} \hline
Sub-Sector & kWh/sq. m \\ \hline 
Wholesale Trade & 218 \\ 
Retail Trade & 229 \\ 
Transportation \& Warehousing & 148 \\ 
Information \& Cultural Industries & 110 \\ 
Offices & 343 \\ 
Education & 68 \\ 
Health Care \& Social Assistance & 225 \\ 
Arts, Entertainment \& Recreation & 177 \\ 
Accommodation \& Food Services & 490 \\ 
Other Services & 96 \\
\hline\end{tabular}
\end{table}

\subsection{Input Data}
\label{sec:inputdata}
Input data to the EPAnalyzer is: building type, building sub-type, square footage, annual energy usage, and a load profile. Additionally, information on the date the load profile was collected can be included to refine the analysis. 

\subsection{Data Cleansing}
\label{sec:datainitializer}
The EPAnalyzer DataInitializer class processes and cleanses IESO energy demand data \cite{IESOdemand} and prepares it for processing by the Weka data mining tool \cite{weka}. As an input to this process, we use published reports from IESO for specific industry types and subtypes. These data files provide typical loads shapes for two day types: weekend and weekday, and for 12 months. The load shapes are normalized to 1 MW, or 8760 MWh on an annual level. The cleansing process removes excess information. The load profiles are also normalized for each day. 

The Residential data is provided in groups of end use: baseload, and individual heating and cooling implementations. The EPAnalyzer generates typical usage profiles for Residential buildings with all the different combinations of heading and cooling implementations. These different Residential types were illustrated in Table~\ref{residentialtypes}. The final product of EPAnalyzer DataInitializer are comma separated value (CSV) files combining all the usage types and subtypes for each month and day type along with their load profile. The CSV file is used as input to the Weka \cite{weka} data mining software. 

\subsection{Data Mining}
The resulting file with amalgamated information from the EPAnalyzer DataInitializer is then used as input to the Weka \cite{weka} GUI. Logistic Regression \cite{Rouhani-Kalleh2006} is used to run models on the data. Different data subsets are used for Classifier model requirements. The longest running model was the classifier which determined energy usage subtypes across all Commercial, Industrial, and Residential types for given day type, month, and energy profile, and the case where no month is provided. Models for classifying subtypes when the specific type group is known and models for classifying between the main energy usage types of Commercial, Industrial, and Residential were also created. We generated Classifier models for instances where a month is provided or not for each of these cases. Table~\ref{classifiermodels} summarizes the different models created. 

The classifier model for all energy usage subtypes required three days to complete due to the number of classification types. Models for classifying subtypes when the specific type group is assumed ran much faster, in the order of hours or seconds. The timings were collected from a standard home use computer with the following stats: Intel Core2 Duo CPU 2.8 GHz and 4 GB of ram. The simulation was running on one processor. 


\begin{table}
\centering
\caption{Energy Usage Classifier Models}
\label{classifiermodels}
\begin{tabular}{|r|l|l|l|} \hline
 & Attributes & & \\ \hline
Classified by & Energy Profile & Day type & Months \\ \hline
All Types & Y & Y & Y \\
All Subtypes & Y & Y & Y \\
All Types & Y & Y & N \\
All Subtypes & Y & Y & N \\
Commercial Subtypes & Y & Y & Y \\
Industrial Subtypes & Y & Y & Y \\
Residential Subtypes & Y & Y & Y \\
Commercial Subtypes & Y & Y & N \\
Industrial Subtypes & Y & Y & N \\
Residential Subtypes & Y & Y & N \\
\hline\end{tabular}
\end{table}

\subsection{Data Analysis}
The EPAnalyzer EnergyProfiler class reads input CSV file types with the following information: id, type, subtype, area, annual energy, month, day type, and hourly usage. Table~\ref{inputheaders} summarizes the input headers. EnergyProfiler also reads in all the classifier models and calls the DataMatcher class to determine what probability the input profile will match with the given subtypes. 

\begin{table}
\centering
\caption{Input Headers}
\label{inputheaders}
\begin{tabular}{|l|} \hline
Header \\ \hline 
Id \\ 
Industry (type) \\ 
Usage (subtype) \\ 
Area \\ 
Annual Energy \\ 
Month \\ 
Day Type \\ 
Hourly Usage \\
\hline\end{tabular}
\end{table}

The EPAnalyzer DataMatcher class computes probabilities for each input profile. These are calculated based on regression coefficients and intercepts from the classifier models. This probability is called the logistic function. Details on how to perform the calculations are described in another paper \cite{Rouhani-Kalleh2006}. The probabilities for all the possible usage types for each model are calculated and then sorted. For analysis purposes, the top three probabilities are returned. This can be adjusted to allow for additional fine-tuning and analysis. Probabilities determined by comparing energy intensity were determined by using the percentage error of the expected energy intensity and the calculated value. 

A match means that the property information provided is up to date and the building is using electricity as is typical for its type and sub-type.  A mismatch can mean that the property information is out of date, or the property is being used for illegal or unauthorized activities. Furthermore, a mismatch can provide information on how energy efficient the building is and on how it could improve to be on par with typical usage. These buildings could benefit from energy efficiency incentive programs offered by the government such as Every Kilowatt Counts\cite{Authority2010}.


\subsection{Evaluation}
The EPAnalyzer tool was tested for its ability to detect building usage types and sub-types. It was given type of day information: weekend or weekday. The test cases also varied whether the month of the profile was provided. Building area and annual energy usage were also used to test the matching ability of the energy intensity calculations. 

Simulated annual energy usage data was created using energy usage simulations for residential and non-residential\cite{CanmetENERGY2010} building types and data from IESO reports\cite{Consulting2005}. For input, we used information for each building type: Commercial, Industrial, and Residential. Several profiles for each subtype was used. These were profiles for typical January and July weekdays and weekends. 

Real world data was used from Queen's University Live Building Project \cite{QueensLiveBuilding}. Data was collected for profiles of typical January and July weekdays and weekends. The building examined was Ban Righ Hall which is a residence. Table~\ref{queenscases} summarizes the Live Building test cases.

We tested the ability to differentiate between the building usage types: Commercial, Industrial, and Residential. This was also tested when the month is as unknown. The evaluation tests the ability to determine building subtypes when a industry usage type is assumed, and when the industry usage type is unknown. Both of these tests are run for the situation where month is also unknown.

\begin{table}
\centering
\caption{Queen's University Live Building Test Cases}
\label{queenscases}
\begin{tabular}{|l|l|l|l|} \hline
Case & Building & Month & Day \\\hline
1 & Ban Righ Hall & January & Weekday  \\
2 & Ban Righ Hall & July & Weekday  \\
3 & Ban Righ Hall & January & Weekend \\
4 & Ban Righ Hall & July & Weekend \\
\hline\end{tabular}
\end{table}

\subsection{Results}
Evaluation results are described in the sections below, based on the ability to detect usage types and based on whether there is known or unknown month information. 

\subsubsection{Live Building Test Cases}
The results from the Queen's University Live Building test cases are as follows. Table~\ref{queenscases1} lists the results when the profile is matched with a building type: Commercial, Industrial, or Residential and the month type is known. Table~\ref{queenscases2} depicts the probabilities when the month type is known as input. 

The EPAnalyzer is not able to determine if the building is of Commercial type and Education subtype. This could be due to the atypical nature of the university buildings. Ban Righ Hall holds a residence and a banquet hall as well. Because the model groups many different types of Commercial buildings for the analysis, these outliers are difficult to identify. By using square meter area and annual energy usage, the tool determined that the building was of Commercial type and Multi-Residential subtype. This is a closer match to Ban Righ Hall.

\begin{table}
\centering
\caption{Queen's University Live Building Results: Building type where month is known}
\label{queenscases1}
\begin{tabular}{|l|l|l|} \hline
Case & Type & Probability \\ \hline
1 & Industrial & 0.646240217 \\
2 & Residential & 0.997717129 \\
3 & Residential & 0.99961842 \\
4 & Residential & 0.999999999 \\
\hline\end{tabular}
\end{table}


\begin{table}
\centering
\caption{Queen's University Live Building Results: Building type where month is unknown}
\label{queenscases2}
\begin{tabular}{|l|l|l|} \hline
Case & Type & Probability \\ \hline
1 & Industrial & 0.621235008 \\
2 & Residential & 0.811640945 \\
3 & Residential & 0.995711878 \\
4 & Residential & 1 \\
\hline\end{tabular}
\end{table}

\subsubsection{Residential Test Cases}
\label{sec:restestcases}

The residential test data follows the same pattern as above. Each subtype will be provided four times. Twice for the months of January and July, and twice for each day type. Table~\ref{testcasepattern} demonstrates the general pattern. 

\begin{table}
\centering
\caption{Generic Test Case Pattern}
\label{testcasepattern}
\begin{tabular}{|l|l|l|} \hline
Case & Month & Day \\\hline
1 & January & Weekday  \\
2 & July & Weekday  \\
3 & January & Weekend \\
4 & July & Weekend \\
\hline\end{tabular}
\end{table}

The results from the Residential test cases are as follows. For the trial that determines building type, two of the cases were incorrectly matched to the Commercial type when month was known. Two more cases were incorrectly identified as Commercial when month was unknown. It is clear that the extra information provided from the month the profile was collected can reduce the inaccurate predictions by half. 

Table~\ref{rescases1} summarizes the results of building type matching for when the month was known and when the month was unknown in columns TM and TN. Column TM is the trial where type is determined when the month is known. Column TN is the trial where type is determined when the month is unknown. The letters  C, I, and R represent Commercial, Industrial, and Residential respectively. For Columns A, B, C, and D, the numbers indicate where the correct subtype ranked from the probability results returned. Column A is the case where the trial compared only subtypes with in this usage type and included months. Column B is the trial where the month is unknown. Column C is the case were all subtypes are compared and months are included. Column D compares all subtypes with unknown months. Abbreviations are explained in Table~\ref{residentiallegend}. A dash "-" indicates that the correct usage subtype did not rank.

From the number of 1s, the model is quite accurate. Several examples highlight limitations. Case 10 and 12 have BaseLoad with BaseBoard heating in a summer month. Since heating systems are not used in the summer, it is difficult for the model to predict the subtype. Cases 14,15 and 34,36 and 38,40 share this same problem. Similarly, testing for Central Air Conditioning is difficult in the winter months. The BaseLoad with CentralAir cases 5,7 are not predicted as well as cases 4,6 which are in the summer.

\begin{table}
\centering
\caption{Residential Test Cases: Building type matching where month is known and unknown and Building subtype matching for only residential subtypes and for all subtypes where month is known and unknown}
\label{rescases1}
\begin{tabular}{|l|l|l|l|l|l|l|l|} \hline
Case & Subtype & TM & TN & A & B & C & D \\ \hline
1 & BL & R & R & 1 & 2 & 1 & 2 \\
2 & BL & R & R & 1 & 1 & 1 & 1 \\
3 & BL & R & R & 1 & 2 & 1 & 2 \\
4 & BL & R & R & 1 & 1 & 1 & 1 \\
5 & BL\&AC & R & R & 2 & 1 & 2 & 1 \\
6 & BL\&AC & R & R & 1 & 1 & 2 & 1 \\
7 & BL\&AC & R & R & 2 & 1 & 2 & 1 \\
8 & BL\&AC & R & R & 1 & 1 & 1 & 1 \\
9 & BL\&BB & R & C & 2 & 2 & 1 & 1 \\
10 & BL\&BB & R & R & 1 & 1 & - & 1 \\
11 & BL\&BB & R & R & 1 & 2 & 3 & 2 \\
12 & BL\&BB & R & R & 1 & 1 & - & 1 \\
13 & BL\&EF & R & R & 2 & 2 & 2 & 1 \\
14 & BL\&EF & R & R & 1 & 1 & - & 1 \\
15 & BL\&EF & R & R & 1 & 2 & 2 & 2 \\
16 & BL\&EF & R & R & 1 & 1 & 3 & 1 \\
17 & BL\&SH & R & R & 1 & 1 & 2 & 1 \\
18 & BL\&SH & R & R & 1 & 1 & 3 & 1 \\
19 & BL\&SH & R & R & 1 & 2 & 2 & 2 \\
20 & BL\&SH & R & R & 1 & 1 & 2 & 1 \\
21 & BL\&WH & R & R & 1 & 1 & 2 & 1 \\
22 & BL\&WH & R & R & 1 & 1 & - & 1 \\
23 & BL\&WH & C & C & 1 & 1 & - & 1 \\
24 & BL\&WH & R & R & 1 & 1 & 2 & 1 \\
25 & BL\&AC\&BB & R & C & 1 & 1 & 1 & 2 \\
26 & BL\&AC\&BB & R & R & 2 & 1 & - & 1 \\
27 & BL\&AC\&BB & R & R & 2 & 1 & - & 1 \\
28 & BL\&AC\&BB & R & R & 1 & 2 & 2 & 2 \\
29 & BL\&AC\&EF & R & R & 1 & 1 & 1 & 2 \\
30 & BL\&AC\&EF & R & R & 1 & 1 & 3 & 1 \\
31 & BL\&AC\&EF & R & R & 2 & 1 & 1 & 1 \\
32 & BL\&AC\&EF & R & R & 1 & 1 & 3 & 1 \\
33 & BL\&AC\&SH & R & R & 1 & 2 & 1 & 2 \\
34 & BL\&AC\&SH & R & R & 1 & 2 & - & 1 \\
35 & BL\&AC\&SH & R & R & 1 & 1 & 1 & 1 \\
36 & BL\&AC\&SH & R & R & 2 & 2 & - & 2 \\
37 & BL\&AC\&WH & R & R & - & - & 1 & 2 \\
38 & BL\&AC\&WH & R & R & - & - & - & 1 \\
39 & BL\&AC\&WH & C & C & - & - & 3 & 2 \\
40 & BL\&AC\&WH & R & R & - & - & - & 1 \\
\hline\end{tabular}
\end{table}




\subsubsection{Commercial Test Cases}

The input data follows the same pattern as the Residential test data. Table~\ref{commcases1} lists the Commercial test cases. The format follows that of the Residential test case Table~\ref{rescases1}. Details are described in Section~\ref{sec:restestcases}.

Most of the building subtypes were correctly identified as Commercial type. Hotels and multi residential buildings were mistaken for Residential on several occasions. This is probably due to the fact that hotels and multi residential buildings are heavy consumers of energy for environmental control systems, and hot water production \cite{HotelSector}. This is the same pattern seen in Residential building types. 

Churches are identified as Industrial types for a winter month on a weekend. This is because religious institutions have an unusual load profile which uses more energy during the weekends and during winter months. Religious ceremonies are held on weekends and during the winter months, more of them need to be held indoors with environmental control systems running.

It is interesting to note that column where month is unknown is more accurate at guessing the subtype than the column where month is known. This is probably due to the fact that most Commercial buildings run all year round with similar systems. Thus the model is better able to match the general trend even without information about the month. 

\begin{table}
\centering
\caption{Commercial Test Cases: Building type matching where month is known and unknown and Building subtype matching for only commercial subtypes and for all subtypes where month is known and unknown}
\label{commcases1}
\begin{tabular}{|l|l|l|l|l|l|l|l|} \hline
Case & Subtype & TM & TN & A & B & C & D \\ \hline
1 & Church & C & C & 1 & 1 & 1 & 1 \\
2 & Church & C & C & 1 & 1 & 3 & 1 \\
3 & Church & I & I & 1 & 1 & 3 & 1 \\
4 & Church & C & C & 1 & 1 & 3 & 1 \\
5 & Education & C & C & 1 & 1 & 2 & 1 \\
6 & Education & I & I & 1 & 1 & 3 & 1 \\
7 & Education & C & C & 1 & 1 & 1 & 1 \\
8 & Education & C & C & 1 & 1 & 2 & 1 \\
9 & Food & C & C & 1 & 1 & 2 & 1 \\
10 & Food & C & C & 1 & 1 & 2 & 1 \\
11 & Food & C & C & 1 & 1 & 1 & 1 \\
12 & Food & C & C & 1 & 1 & 1 & 1 \\
13 & Health & C & C & 1 & 1 & 2 & 1 \\
14 & Health & C & C & 1 & 1 & 2 & 1 \\
15 & Health & C & C & 1 & 1 & 1 & 1 \\
16 & Health & C & C & 1 & 1 & 1 & 1 \\
17 & Hotels & C & C & 1 & 1 & 1 & 1 \\
18 & Hotels & R & C & 1 & 1 & 2 & 1 \\
19 & Hotels & C & C & 1 & 1 & 1 & 1 \\
20 & Hotels & C & C & 1 & 1 & 1 & 1 \\
21 & MultiRes & R & R & 1 & 1 & 1 & 1 \\
22 & MultiRes & C & C & 1 & 1 & 1 & 1 \\
23 & MultiRes & R & R & 1 & 1 & 2 & 1 \\
24 & MultiRes & R & R & 1 & 1 & 1 & 1 \\
25 & Offices & C & C & 1 & 1 & 1 & 1 \\
26 & Offices & C & C & 1 & 1 & 1 & 1 \\
27 & Offices & C & C & 1 & 1 & 2 & 1 \\
28 & Offices & C & C & 1 & 1 & 1 & 1 \\
29 & OtherRetail & C & C & 1 & 1 & 3 & 1 \\
30 & OtherRetail & C & C & 1 & 1 & 1 & 1 \\
31 & OtherRetail & C & C & 1 & 1 & 3 & 1 \\
32 & OtherRetail & C & C & 1 & 1 & 1 & 1 \\
33 & Rec & R & R & 1 & 1 & 1 & 1 \\
34 & Rec & C & C & 1 & 1 & 1 & 1 \\
35 & Rec & C & C & 1 & 1 & 2 & 1 \\
36 & Rec & C & C & 1 & 1 & 1 & 1 \\
37 & Services & C & C & 1 & 1 & 1 & 1 \\
38 & Services & C & C & 1 & 1 & 1 & 1 \\
39 & Services & C & C & 1 & 1 & 3 & 1 \\
40 & Services & C & C & 1 & 1 & 1 & 1 \\
41 & Warehouse & I & C & - & - & 1 & 1 \\
42 & Warehouse & I & I & - & - & 1 & 1 \\
43 & Warehouse & C & C & - & - & 1 & 1 \\
44 & Warehouse & I & I & - & - & 1 & 1 \\
\hline\end{tabular}
\end{table}


\subsubsection{Industrial Test Cases}

The input data follows the same pattern as the Residential test data. Table~\ref{commcases1} lists the Commercial test cases. The format follows that of the Residential test case Table~\ref{rescases1}. Details are described in Section~\ref{sec:restestcases}.

Energy usage for Construction on winter weekends typically flattens out. The shape is very similar to a residential subtype of BaseLoad, CentralAir and BaseBoard or a commercial office subtype. These appear similar even when the actually energy consumption maybe greatly different because the usage profiles are normalized before being analyzed. 

Chemical, Construction, Food, NonMetalic, and Transport were predicted wrong by a small factor. In each of these cases, filtering by building type first would have narrowed the choices available and improved the accuracy of the prediction.

It is interesting to note that similar to the Commercial test results, the column where month is unknown is more accurate at guessing the subtype than the column where month is known. This is probably due to the same reasoning as for Commercial buildings. Industrial buildings run all year round with similar systems. Thus the model without months is better able to match the trends. 

\begin{table}
\centering
\caption{Industrial Test Cases: Building type matching where month is known and unknown and Building subtype matching for only industrial subtypes and for all subtypes where month is known and unknown}
\label{indcases1}
\begin{tabular}{|l|l|l|l|l|l|l|l|} \hline
Case & Subtype & TM & TN & A & B & C & D \\ \hline
1 & Chemical & I & I & 1 & 1 & 3 & 1 \\
2 & Chemical & I & I & 1 & 1 & 3 & 1 \\
3 & Chemical & I & I & 1 & 1 & 1 & 1 \\
4 & Chemical & I & I & 1 & 1 & 1 & 1 \\
5 & Clothing & I & I & 1 & 1 & 1 & 1 \\
6 & Clothing & I & I & 1 & 1 & 1 & 1 \\
7 & Clothing & I & I & 1 & 1 & 1 & 1 \\
8 & Clothing & I & I & 1 & 1 & 1 & 1 \\
9 & Construction & I & I & 1 & 1 & 1 & 1 \\
10 & Construction & I & I & 1 & 1 & 1 & 1 \\
11 & Construction & I & I & 1 & 1 & 3 & 1 \\
12 & Construction & I & I & 1 & 1 & 1 & 1 \\
13 & Electrical & I & I & 1 & 1 & 2 & 1 \\
14 & Electrical & I & I & 1 & 1 & 2 & 1 \\
15 & Electrical & I & I & 1 & 1 & 1 & 1 \\
16 & Electrical & I & I & 1 & 1 & 1 & 1 \\
17 & FabricatedMetals & I & I & 1 & 1 & 1 & 1 \\
18 & FabricatedMetals & I & I & 1 & 1 & 1 & 1 \\
19 & FabricatedMetals & I & I & 1 & 1 & 1 & 1 \\
20 & FabricatedMetals & I & I & 1 & 1 & 1 & 1 \\
21 & Farm & R & R & 1 & 1 & 1 & 1 \\
22 & Farm & I & I & 1 & 1 & 1 & 1 \\
23 & Farm & I & I & 1 & 1 & 1 & 1 \\
24 & Farm & R & R & 1 & 1 & 1 & 1 \\
25 & Food & I & C & 1 & 1 & 2 & 1 \\
26 & Food & I & I & 1 & 1 & 2 & 1 \\
27 & Food & I & I & 1 & 1 & 1 & 1 \\
28 & Food & C & C & 1 & 1 & 2 & 1 \\
29 & Machinery & I & I & 1 & 1 & 1 & 1 \\
30 & Machinery & I & I & 1 & 1 & 1 & 1 \\
31 & Machinery & I & I & 1 & 1 & 1 & 1 \\
32 & Machinery & I & I & 1 & 1 & 1 & 1 \\
33 & Mine & C & C & 1 & 1 & - & 1 \\
34 & Mine & I & I & 1 & 1 & 1 & 1 \\
35 & Mine & I & I & 1 & 1 & 1 & 1 \\
36 & Mine & I & R & 1 & 1 & - & 1 \\
37 & NonMetallic & I & I & 1 & 1 & 1 & 1 \\
38 & NonMetallic & I & I & 1 & 1 & 1 & 1 \\
39 & NonMetallic & C & C & 1 & 1 & 3 & 1 \\
40 & NonMetallic & C & C & 1 & 1 & 3 & 1 \\
41 & OtherMfg & I & I & 1 & 1 & 1 & 1 \\
42 & OtherMfg & I & I & 1 & 1 & 1 & 1 \\
43 & OtherMfg & I & I & 1 & 1 & 1 & 1 \\
44 & OtherMfg & I & I & 1 & 1 & 1 & 1 \\
45 & PrimaryMetals & I & I & 1 & 1 & 1 & 1 \\
46 & PrimaryMetals & I & I & 1 & 1 & 1 & 1 \\
47 & PrimaryMetals & I & I & 1 & 1 & 1 & 1 \\
48 & PrimaryMetals & I & I & 1 & 1 & 1 & 1 \\
49 & Transport & I & I & 1 & 1 & 1 & 1 \\
50 & Transport & I & I & 1 & 1 & 1 & 1 \\
51 & Transport & I & I & 1 & 1 & 1 & 1 \\
52 & Transport & I & I & 1 & 1 & 2 & 1 \\
53 & Wood & I & I & - & - & - & - \\
54 & Wood & I & I & - & - & - & - \\
55 & Wood & I & I & - & - & - & - \\
56 & Wood & I & I & - & - & - & - \\
\hline\end{tabular}
\end{table}


\subsection{Challenges}
A major task of this tool is comparing usage profiles. This challenge could be addressed by comparing subsections of profiles and determining patterns of slope and local maxima/minima points. Initial phases of this project attempted to do graph comparisons. This was difficult to implement because it required manual fine-tuned analysis of profiles. A system of rating profile matching was also required to determine how well a profile fits specific patterns. A simplified heuristic would not be sufficient to classify profiles given a wide variety of usage types. 

Simulators include non-electricity energy sources in generated data. Therefore, those sources of energy must be manually subtracted when comparing usage for this tool. Simulators only provide annual/monthly usage estimates. Therefore test data for usage profiles had to be created by modifying existing data collected from IESO \cite{Operator2010} \cite{IESO2006} \cite{IESO2010} and research papers which contain usage profiles\cite{Authority2006} \cite{Consulting2005} . 

There is a large variance in residential energy demand profiles due to: different types of heating and cooling systems, occupation/lifestyle of home dwellers, and summer and winter energy use. The tool had to create profiles for each of the combinations of heating and cooling system usage. It also maintains patterns for typical weekend and weekday usage. Home dwellers that have pensioner lifestyles will have weekday usage profiles which look like a typical weekends.

The process of simulating different types of commercial and industrial properties is complex with many variables. The scope of the project was reduced to only cover a subset of possible variables available. 


\subsection{Future Work}
Although the EPAnalyzer is currently in a functional state, the application does not run in a single complete end-to-end process. It is currently working as a multi-step application. Data cleansing is performed within EPAnalyzer DataInitializer. The cleansed data is modeled using the Weka\cite{weka} GUI and the resulting coefficients and intercepts are loaded into EPAnalyzer EnergyProfiler when it is processing test data. For future iterations, EPAnalyzer can call the Weka API to run data mining tasks and extract information on coefficients and intercepts, and call the EnergyProfiler all in one step. 

As an extension, further work can be done to generate energy usage profiles for residential and non-residential types of buildings to create neighbourhoods and different building type profiles. Models can be run at the neighbourhood level to determine trends in profiles. This tool can facilitate determining regulations in place for energy usage per building type are being adhered to. Additional building attribute information and neighbourhood information can be used for future iterations of the project to fine-tune results. Potentially, the tool can be used to highlight violations of regulations or possible incentive programs the building is eligible for. Currently the EPAnalyzer tool factors in the month and day type. Further enhancements can accommodate more fine-grained weather changes or environmental factors which affect energy usage profiles. Applying models which use multiple usage profiles to match to specific usage types or subtypes would enhance the model and increase the rate of correct matches. 

The tool can be used as a multi-step decision tree where by the usage subtype can be determined by honing in on key factors and reducing the classifier set. This is left as work which can be explored in further iterations of the tool. There is more opportunity to explore energy intensity for Residential and Industrial types. This paper has only looked into Commercial energy intensities.

\subsection{Related Work}

Related work exists in creating energy usage simulations for residential[1] and non-residential[2] building types. 
Monitoring systems exist on a per-building basis and are specific to the layout of each building. [3] This results in more fine-grained data and is beyond the scope of this project.

A previous paper \cite{TaherianProfiling} studied two specific environments for energy consumption: households and office spaces. This study looks closer into human factors which affect energy usage. Authors use a human-centric approach to addressing the energy saving problem. 

\section{Conclusions}
This paper presented a possible implementation of data mining to enforce regulatory monitoring. We leveraged existing data mining software available from the Weka Project \cite{weka} and used data available from IESO publications as the empirical information from which to run models. We presented several models which can be used for usage type matching.  Limitations of the profile matching are evident when usage patters are heavily dependent on external factors such as time of the year or day of the week. Environmental control implementations like central air conditioning or space heating are examples of season and day type dependent systems. The normalized usage profiles make it easier to create models, but it takes away from available information which could be used to determine usage types. This can be mitigated by storing the normalizing factors and reversing the process to make scalar comparisons of the data available for modeling. 

\bibliographystyle{unsrt}
\bibliography{regmonitoring}
\balancecolumns
\end{document}
